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Abstract 

With the progress of measurement apparatus and the development of automatic sen- 
sors it is not unusual anymore to get large samples of observations taking values in high 
dimension spaces such as functional spaces. In such large samples of high dimensional 
data, outlying curves may not be imcommon and even a few individuals may corrupt sim- 
ple statistical indicators such as the mean trajectory. We focus here on the estimation of 
the geometric median which is a direct generalization of the real median in metric spaces 
and has nice robustness properties. The geometric median being defined as the minimizer 
of a simple convex functional that is differentiable everywhere when the distribution has 
no atom, it is possible to estimate it with online gradient algorithms. Such algorithms are 
very fast and can deal with large samples. Furthermore they also can be simply updated 
when the data arrive sequentially. We state the almost sure consistency and the rates 
of convergence of the stochastic gradient estimator as well as the asymptotic normality of 
its averaged version. We get that the asymptotic distribution of the averaged version of 
the algorithm is the same as the classic estimators which are based on the minimization of 
the empirical loss function. The performances of our averaged sequential estimator, both 
in terms of computation speed and accuracy of the estimations, are evaluated with a small 
simulation study. Our approach is also illustrated on a sample of more than 5000 individual 
television audiences measured every second over a period of 24 hours. 

Keywords. CLT, functiorial data, geometric quantiles, high dimension, L^-median, online al- 
gorithms, recursive estimation, Robbins-Monro algorithm, spatial median. 
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1 Introduction 



With the progress of measuremerit apparatus, the development of automatic sensors and the 
increasing storage performances of comp uters it is no t unusu al anymore to get large samples 



of functional observations. For example 



Cardotetal 



(|2010al ) analyze a sample of more than 



18000 electricity consumption curves measured every half hour over a period of two weeks. 
Our study is motivated by the estimation of the central point of a sample of n = 5423 vectors 
of IR*^, with d = 86400, which correspond to individual television audiences measured every 
second over a period of 24 hours. 

In such large samples of high dimensional data, outlying curves may not be uncommon and 
a even few individuals ma y corrupt simp le statistical indicators such as the mean trajectory or 
the principal components jGervinil (|2008n ). Detecting these atypical curves automatically is not 
straightforward in such a high dimensional and large sample context and considering directly 
robust techniques is an inte r esting alternative. There are many robust location indicators in the 
multivariate setting (|Smalll (|l990l )) but most of them require high computational efforts to be 
estimated, even for small sample sizes, when the dimension is relatively large. For example. 



Fraiman and MunizI (|200l[) have extended the notion of trimmed means to a functional context 



in order to get robust estimators of the me an profile. In order to deal with the dimensionality 



issue and to reduce the computation time. 



Cuevas et al 



(|2007n have proposed random projec- 



tion techniques in the context of maximal depth estimators and studied their properties via 
simulation studies. Note that sub-sampling approaches based on survey sampling with un- 
equal probability sampli ng designs have also beeri proposed in the literature in order to reduce 
the computational time (|Chaouch and Gogal (|2010l) ). 

We focus here on the geometric median, also called L^-median or sp atial median, which 
is a direct generalization of the real median proposed by iHaldand (|l948l) and whose proper- 
ties have been studied in details by iKempermanI (|l987l) . It can be defined even if the random 
variable does not have a finite first order mome nt and it has nice robustness properties since 
its breakdown point is equal to 0.5. As noted in ISmalll ([19901), one drawback of the geometric 
median is that it is not affine equivariant. Nevertheless, it is invariant to translation and scale 
changes and thus is well adapted to functional data which are observed with the same units 
at each instant of time. In a functiori a l conte x t, con siste nt estimators o f the L^-median have 
been proposed by Kemperman ( 1987), Icadre ( 2001) and Gervini ( 2008 ). Iterative estimation 
algorithms hav e been develop ed by Goweij ( 1974 ), Vardi and Zhang (2000) in the multivariate 



setting and by 



Gervini 



too 



for functional data. This latter algorithm requires to invert at 
each step matrices whose dimension is equal to the dimensio n d of the data and thus requires 
important computational efforts. The algorithm proposed by lVardi and Zhang i200ol) is much 
faster and only requires 0{nd) operations at each iteration, where n is the sample size. Nev- 
ertheless, these estimation procedures are not adapted when the data arrive sequentially, they 
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need to store all the data and they cannot be simply updated. 

In this paper, we explore another direction. The geometric median being defined as the 
minimizer of a simple functional that is differentiable everywhere when the distribution has 
no atom, it is possible to estimate it with online gradient algorithms. Such algorithms are 
very fast and can be simply updated when the data arrive sequentially. There is a vast litera- 
ture on stochastic g r adien t algorithrns whi c h mainly focus on finite d imensional situati o ns (see 



Kushner and ClarkI (Il978h iRupperj (Il985n. 



Kushner and Yin 



Benveniste et al. 



\l99d ). Ljung et al.l il992l) 



Duflo 



Arnaudon et al 



( 20031) , iBottoul (|201d ) in the multivariate case, and 
(|2010l ) on manifolds). The literature is much less abundant when one has to consider on- 
line observations taking values in a functional space (usually an inf i nite d i mensional Banach 
or Hi l bert space) and most works focus on linear algorithms (|Walkl (|l977l ). iDippon and Walk 
( 2006l ). lsmaleand Yaol jlOOdi )). 

It is also known in the multivariate setting that averaging procedures can lead to efficient es- 
timation procedure under additional assumptior is on the noise and when th e target is define d 
as the minimizer of a strictly convex function (|Polyak and Tuditskyi (|l992l) , iPelletieii (|2000l) ). 
There is little work on averaging when considering random variabl es taking values in Hilbe rt 
spaces and, as far as we know, they only deal with linear algorithms (jPippon and Walkl (|2006l) ). 
Nevertheless, it has been note d in an empirical stu dy whose aim was to estimate the geometric 



median with functional data (Cardot et al 



(|2010a[) ) that averaging could improve in an impor- 
tant way the accuracy of the estimators. 

The paper is organized as follows. We first fix notations, give some properties of the geo- 
metric median and present our stochastic gradient algorithm as well as its averaged version. 
W e also note that ou r study extends directly to the estimation of geometric quantiles defined 
by IChaudhuril (|l996l) . In a third section we state the almost sure consistency and the rates 
of convergence of the stochastic gradient estimators as well as the asymptotic normality of its 
averaged version. We get that the asymptotic distribution of the averaged version of the algo- 
rithm is the same as the classic estimators. A fourth section is devoted to a small simulation 
study which aims at comparing the p erformances of our estimator with the static algorithm 
developed by IVardi and ZhangI (|2000[) . The comparison is performed according to two points 
of view, for the same sample size and for the same computation time. We also analyze a real 
example with a large sample of individual television audiences measured every second over a 
period of 24 hours. The proofs are gathered in Section|6l 
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2 The algorithms and some properties of the geometric median 
2.1 Definitions and assumptions 

Let H be a separable Hilbert space such as R'^ or L^{I), for some closed interval 7 C K. We 
denote by (., .) its inner product and by || • || the associated norm. 

The geometric median m of a random variable X taking values in H is defined by (see 
KempermanI (|l987l) ): 



m 



argminE[||X-H|| - ||X| 



UGH 



(1) 



Note that this general definition ((TJ does not assume the existence of the first order moment of 
||X|| . We suppose from now on that the following assumptions are fulfilled. 

Al. The random variable X is not concentrated on a straight line: for all v ^ H, there iszv ^ H 
such that {v,w) =0 and 

\ar{{iv,X)) > 0. 

A2. The law of X is a mixing of two "nice" distributions : fix = ^V-c + (1 ~ ^)}'^di where 

- He is not strongly concentrated around single points: if B{Q,A) is the ball {a G 
H, II a II < A), and Y is a random variable with law }ic, 



VA,3Ca G [0,oo),Va G 5(0,A), E 



|Y- a| 



1-1 



< Ca. 



- is a discrete measure, }id = P/'^a, - We denote by D the support of f/^ and assume 
that m ^ D. 



As shown in iKempermanI (|l987t) , assumption (Al) ensures that the median m is uniquely 
defined. The second assumpti on could probably be relaxed, but it is general enough for most 
natural examples. As noted in lChaudhuril (| 19921 ), the conditions on }ic are satisfied when H = 
R**, with d > 2, whenever jAc has a bounded density on every compact subset of R"^. More 
precisely, this property is closely related to small ball probabilities since 



E 



lY- al 



P 



|Y-a|| < t 



-1 



dt. 



If P[||Y-a|| <£] < Ce'^Jor some small e and some positive constant C, it is easy to check 
that 

E ||Y-a||"^ < 00, 

whenever < fi < d. 

When H = L^{I), the dimension is not finite and sr nall ball p r obabi lities have been derived 
for some particular classes of Gaussian processes (see iNazarovl (|2009l) for a recent reference). 
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In this case, by symmetry of the distribution, the median m is equal to the mean, and many 
processes satisfy, for positive constants Ci, C2, C3, C4 which depend on the process under study. 



P [||y - m\\ <£]< Ci£^^ exp(-C2e"^3)^ 



(2) 



so that E 



lY-mir^ 



< 00, for all po s itive j 6. Similar properties of shifted small balls, for a 



close to m, can be found in lLi and Shad (|200l[) 



2.2 Some convexity and robustness properties of the median 

In this section we derive quantitative convexity bounds which will be useful in the proofs. As 
a consequence, we are a lso able to bound for the g ross sensitivity error, which is a classical 
robustness indicator (see lHuber and Ronchettil (|2009t) ). 

Recalling the definition of the median (eq. (H)), let us denote by G : H 1— > IR the function we 
would like to minimize: 

G(a):=E[||X-«||-||X||]. (3) 

This function is convex since it is a convex combination of convex functions. However, convex- 
ity is not sufficient to get the convergence of the algorithm. Under assumptions (Al) and (A2) 
this function can be decomposed in two parts: 

G{a)=XGc{oi) + {l-X)Gd{a), 

where the discrete part G(f(a) = E,- PiiWxj — dc \\ — \\x j\\) has been isolated. The first part is 
Frechet differentiable everywhere (jKempermanl (|l987l) ), so G is differentiable except on D, the 
support of the discrete part f/^f . We denote by <I> = AO^ + (1 — A)0(j its Frechet derivative. 



:= V,G 



-E 



X-a 



\X-Di\ 

Remark 1. It will he useful to define ^ on the set D. If x ED, we define Gx by "forgetting" x, 

Gxiy) = p/(||x;-y|| - ||x;||). 



This function is Frechet differentiable in x, and we let 



X — Xi 

\x — xA 



In the vocabulary of convex analysis, we have just chosen a particular subgradient ofG on the set D 
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of non-differentiability. It is easily seen that: 



yx,y, G{y)-G{x)>{^{x),y-x), 

which asserts that O is a subgradient. A short proof of inequality (H) is given in Section 
The median m is then the unique solution of the nonlinear equation, 

<I>(a) = 0. 



(4) 



(5) 



To exhibit some useful strong convexity and robustness properties of the median we need 
to introduce the Hessi an of functional G, for a G H \ D. It is denoted by Tg, maps H to H and 
it is easy to check (see iKoltchinskiil (|l997n for the multivariate case and iGervinil (|2008l ) for the 
functional one) that 



E 



X-a 



(X-a)®(X-a) 
llX-aiP 



where Ih is the identity operator in H and u ^ v{h) = {u, h) v, for u, v and h belonging to H. 
The operator is not compact but it is bounded when E || X — a < oo. 

If we define h = hi and the projection onto the orthogonal complement of h, 



{hJcch) 



\hf^ 



\hf^ 



|a-X| 
1 



[h,ci-X) 
||a-X 

2 



k-X 



k-X| 



(6) 



We can now state a strong convexity property of function al G which can be seen as an 
extension to an infinite dimensional setting of Proposition 4.1 in IKoltchinskiil (|l997l ). 



Proposition 2.1. Recall that B{Q,A) is the ball of radius A in H. Under assumptions Al and A2, 
there is a strictly positive constant Ca, such that: 

Va G B{0,A)\D,yh G H, ca < {hJcch) < Ca \\h\\^ ■ 

In other words, G is strictly convex in H and it is strongly convex on any bounded set, as 
shown in the following corollary. 

Corollary 2.2. Assume hypotheses of Proposition \2.1\ are fulfilled. For any strictly positive A, there is 
a strictly positive constant Ca such that: 



Vai,a2 G B{Q,Af, 



(0(a2) - 0(ai),a2 - ai) > Ca ||a2-*i||^- 



6 



As a particular case of Froposition l2.1l we get that there exist two strictly positive constants 

^ < 00, such that 



(7) 



As noted in iKempermanI (|l987l ), the geometric median has a 50 % breakdown point. Fur- 
thermore, an immediate consequence of (|7l is that operator r„, has a bounded inverse. Thus, 
the gross error sensitivity, which is also a classical indicator of robustness, is b ounded for th e 
median in a separable Hilbert space. Indeed, thanks to the expression derived in 
it is bounded as follows. 



Gervini 



sup 

2GH 



m 



m\ 



< 



2.3 The algorithms 

Given Xi, X2, . . . , X„, n independent copies of X, a natural estimator of m is the solution m„ of 
the empirical version of (O, 

Xi - fhn 



E 



0. 



nin 



The solution fhn is defined implicitly and is found by iterative algorithms. 

We propose now an alterr iative and simple estimation algorithm which can be seen as a 



We propose now an alterr iative ana simp j 
stochastic gradient algorithm (jRuppert 



+1 



Dufld (|l997l )) and is defined as follows 

X^+i — Z„ 



Zm 



(8) 



with a starting point that can be random and bounded, e.g. Zq = Xo1{||Xo||<m} for some positive 
constant M fixed in advance, or deterministic. If X„+i = Z„, we set Un+i = and Z„_|_i = Z„ 
so the algorithm does not move. The sequence of descent steps 7„ controls the convergence 
of the algorithm. The direction is an "estimate" of the gradient ^ of G at Z„ since the 
conditional expectation given the sequence of cr-algebra = (j{Zi, . . . ,Zn) = (j{Xi, . . . ,X„) 
satisfies 

E[iJ„+i|J-„] =0(Z„). (9) 

Note that our particular choice of subgradient O ensures that this equality always holds. 
Defining now by ^ the sequence of "errors" in these estimates. 



0(Z„) - Lf, 



n+l/ 



(10) 
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algorithm ((8]l can also be seen as a non linear Robbins-Monro algorithm. 



Z„+i = Z„ + 7„ (-0(Z„) + ^„+i) . 



(11) 



Thanks to (|9]l and (ITOl l, the sequence (^„) is a sequence of martingale differences. Let us note 
that the bracket of the associated martingale satisfies. 







•F n 


= E 






+ ||a)(Z„)f -2(0(Z„),E[LJ„+i|J-„]) 








= E 









< 1- |lO(Z„)f < 1. 

Our second algorithm consists in averaging all the estimated past values, 

1 



(12) 



Zfi+i — Z^ + 



n + 1 



z„) 



with Zo = 0, so that Zn = \ E-Li 



R emark 2. An exte nsion of the notion of quantiles in Euclidean and Hilbert spaces has been proposed 
by \Chaudhurn U99W . In such spaces, quantiles are associated to a direction and a magnitude specified 
by a vector z; G H, such that \\v\\ < 1. The geometric quantile ofX, say m° , corresponding to direction 
V and magnitude \\v\\ is defined, uniquely under previous assumptions, by 

= argminE [||X - u\\ + {X-u,v)] . 

If V = one recovers the geometric median. When \\v\\ is close to one, m" is a (directed) extreme 
quantile. In any case, m" is characterized by: 

<^y{m'') = <X>{m'')-v = 0, 

so that it can be naturally estimated with the following stochastic algorithm 

X„+i - ml 



m 



n+l =K + 7n 



as well as with its averaged version. 
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3 Convergence results 

3.1 Almost sure convergence of the stochastic gradient algorithm 

We first state the almost sure consistency of our sequence of estimators Z„ under classical and 
general assumptions on the descent steps 7^. 

Theorem 3.1. If(Al) and (Al) hold, and if (7n)nG]N satisfies the usual conditions: 

n>l n>l 

then 

lim ||Z„ — mil = 0, a.s. 



3.2 Rates of convergence and asymptotic normality 

We present now the rates of convergence of the stochastic gradient algorithm as well as the 
asymptotic distribution of its averaged version. The proofs are given in Section|6] More specific 
sequences (7^) are considered and we suppose from now on that 7„ = c^n^'^, where Cy is a 
positive constant and a G (j, !)■ We need one additional assumption to get these rates of 
convergence: 

A3. There is a positive constant A such that 

3Ca G [0,oo),V/j G i3(0,A), E [llX- (m + /i)|p^j < Ca- (13) 

This assumption is not restrictive when the dimension is strictly larger than two as discussed 
in Section IZTl 

The following proposition states that, on events of arbitrarily high probability, th e func- 



tiona l estimator Z„ attains the classical rates of convergence in quadratic mean (see jPuflo . 



19971, theorem 2.2.12) for the multivariate case) up to a logarithmic factor. 

Proposition 3.2. Assume (Al), (A2) and (A3). Then, there exist an increasing sequence of events 
(n]v)MGN/ ^nd constants C^, such that Q = Ungn i^i^d 



VN, E 



Iom \\Z„ - mf 



<CN7»ln( E7O <Cm^ 



Remark 3. An immediate consequence of Proposition \3.2\ is that 

\Zn-mf = Op' ^ ^ 
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Assumption (A3) is needed to bound the difference between G and its quadratic approxi- 
mation, in a neighborhood of m as stated in the following Lemma. 

Lemma 3.3. Assume (A3) is in force. Then, 



yheB{0,A), ^{m + h) =Ym{h) + 0(\\h 



Finally, Theorem 13.41 stated below probably gives the most important result of this work. 
It is shown that the averaged estimator Z„ and the classic static estimator fhn have the same 
asymptotic distribution. Consequently, for large sample sizes, it is possible to get, very quickly, 
estimators which are as efficient, at first order, as the slower static o ne fh„. Note that t he asymp- 



Haberman 



(Il989l ). Theorem 



totic distribution of m„ has been derived in the multivariate case byO 
6.1. For variables taking values in a Hilbert space, such asymptotic distribution has only been 
proved f or a p articular case, when the support of X is a finite dimensional space (Theorem 6 in 



Gervini 



Theorem 3.4. Assume (Al), (A2) and (A3). Then, 



^ [Zn - m) -^M (o,r„-iEr,-i) , 



with. 



Z = E 



(X - m) (X - m) 



|X — ml \\X — m\ 



Note that with 0, operator is well defined, it is bounded and positive. 



4 Illustrations on simulated and real data 



4.1 A simulation study 

A simple simulation study is performed to check the good behavior of the av eraged stochastic 
estimator and to make a comparison with the static estimator developed by IVardi and Zhang 
(2000). Two points of view are considered. The first classic one consists in evaluating the perfor- 
mances of these two different approaches for different sample sizes. The second one, which is 
the point of view that should be adopted when computation time matters, consists in compar- 
ing the acc uracy of both approaches when th e allocated computation time is fixed in advance. 
We use ^ (|R Development Core TeamI (|2010l) ) and the function spat ial . median frorn the li - 
brary ICSNP to estimate the median with the algorithm developed by lVardi and ZhangI (|2000l) . 
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For simplicity, we consider random variables taking values in M.^ and make simulations of 
Gaussian random vectors with median m = (0, 0, 0) and covariance matrix: 





(3 


2 


r = 


2 


4 




^1 


-0.5 



In order to compare the accuracy of the different algorithms, we compute the following esti- 
mation error, 

R{fh) = \\m — m\\ , (14) 

where m is an estimator of m. 

Our averaged estimator depends on the tuning parameters a and which control the de- 
scent steps 7)t = Cjk^"'. It is well known that for the particular case a = 1, the choice of param- 
eter c-y is crucial for the convergence and depends on the second derivative of G in m which is 
unknown in practice. As usually done for such procedures, we fix a = 3/4 and focus on the 
choice of c^. We also rim in parallel the algorithm for 10 initial points chosen randomly in the 
sample and then select the best estimate in which corresponds to the minimum value of 



which is the empirical version of (|3]l. 



4.1.1 Fixed sample sizes 



We perform 1000 simulations for different sample sizes, n = 250, n = 500 and n = 2000. 
Table [T] presents basic statistics for the estimation errors (fir st quartile Qi , median a nd third 



quartile Q3), according to criterion ((T4]l , for the algorithm by IVardi and Zhang! (|2000j) and our 
averaged procedure considering different values for Cy G {0.2, 0.6, 1, 2, 5, 10, 15, 25, 50, 75}. 

At first, we can note that even for moderate sample sizes the averaged procedure performs 
well in comparison with the Vardi and Zhang estimator which only gives slightly better esti- 
mations. We can also remark that the averaged stochastic estimator is not much sensitive to the 
tuning parameter Cy which can take values in the interval [2, 75] without modifying the perfor- 
mances of the estimator. As a matter of fact, we noted on simulations that interesting values 
for Cy are around or above E [||X — m||] , which is about 2.7 for this particular simulation study. 

4.1.2 Fixed computation time 

Even if both algorithms require computation times which are 0{nd) (for n observations in 
dimension d), the averaged stochastic gradient approach is much faster (on the same computer. 
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Table 1: Comparison of the estimation errors for different sample sizes 





n=250 


n=500 


n=2000 


Estimator 


roi 




031 


roi 


I.A. L 


031 


roi 




031 


Cry = 0.2 


0.45 


0.60 


0.80 


0.38 


0.53 


0.69 


0.25 


0.35 


0.47 


= 0.6 


0.21 


0.29 


0.40 


0.15 


0.21 


0.29 


0.06 


0.09 


0.12 


C-y = 1 


0.15 


0.22 


0.31 


0.11 


0.16 


0.21 


0.05 


0.08 


0.10 


C-y = 2 


0.15 


0.21 


0.30 


0.09 


0.15 


0.20 


0.05 


0.07 


0.10 


C'y 5 


0.13 


0.19 


0.25 


0.09 


0.13 


0.18 


0.04 


0.06 


0.09 


Cry = 10 


0.13 


0.18 


0.25 


0.09 


0.13 


0.18 


0.04 


0.06 


0.09 


= 15 


0.12 


0.18 


0.25 


0.09 


0.13 


0.18 


0.04 


0.06 


0.08 


Cry = 25 


0.13 


0.19 


0.26 


0.09 


0.13 


0.18 


0.04 


0.06 


0.09 


C-y = 50 


0.13 


0.19 


0.26 


0.09 


0.13 


0.18 


0.04 


0.06 


0.09 


C-y = 75 


0.14 


0.20 


0.27 


0.09 


0.14 


0.19 


0.05 


0.07 


0.09 


Vardi & Zhang 


0.12 


0.18 


0.25 


0.09 


0.12 


0.17 


0.04 


0.06 


0.08 



with procedures coded in the same ^ language). For example, in previous simulations, if the 
sample size is n = 1000, the averaged estimator is about 30 times faster. When the dimension 
gets larger the difference is even more impressive, as we wiU see in the next section. 

Let us suppose the allocated time for computation is limited and fixed in advance, say 1 
second, and comp are the sample sizes tha t can be handled by the different algorithms. The 
static estimator by IVardi and ZhangI (|2000l ) can deal with n = 150 observations, whereas our 
recursive algorithm, coded in the ® language, can take into account n = 4500 observations so 
that it will gives much better estimates of the median, as seen in Tabled] Finally, if the algorithm 
is coded in C and called from ®, then it is at least 20 times faster than its *® analogue, so that 
it can deal with at least n = 90000 observations, during the same second. 



4.2 Estimation of the median television audience profile 

The analysis of audience profiles for different channels, or different days of the year, is an 
essential tool to understand the consumers' habits as regards television. The French company 
Mediametrie provides official television audience rates in France. Mediametrie works with a 
panel of about 9000 individuals and the television sets of these individuals are equipped with 
sensors that measure the audience of the different channels at a second scale. 

A sample of around 7000 people is drawn every day in this panel and the television con- 
sumption of the people belonging to this sample is recorded every second. The data are then 
sent sequentially to Mediametrie during the night. Survey sampling techniques with unequal 
probability sampling designs are used by Mediametrie to select the sample and thus the i.i.d 
assumption is clearly not satisfied. Nevertheless, our aim is just to give an illustration of the 
ability of our averaged stochastic algorithm to deal with a large sample of very high dimen- 
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sional data. Moreover, Mediametrie has noted in these samples the presence of some atypical 
behaviors so that robust techniques may be helpful. 

We focus our study on the estimation of the television audience profile during the 6th 
September 2010. After removing from the sample people that did not watch television at all 
on that day, we finally get a sample of size n = 5423. For each element / of the sample, we have 
a vector X, G {0, l}86400^ where 86400 is the number of seconds within a day, and zero values 
correspond to seconds during the day where / is not watching television. 

A classical audience indicator is given by the mean profile, drawn in Figure [TJ which is 
simply the proportion of people watching television at every second over the considered period 
of time. We compare this classical indicator with the geometric median, whose estimation is 
drawn in black in Figure [T] We can first note that both estimators have the same shape along 
time, showing three peaks of audience during the day with higher audience rates between 
8 and 10 PM. Estimated values are smaller for the geometric median which is less sensitive 
to small perturbations and outliers. This also indicates that the distribution of the individual 
audience curves is not symmetric around the mean profile. 

From a computational point of view, even if the database is huge, it takes less than one 
minute for our algorithm to converge whereas we were not able to perform the estimation 
with the static estimator developed by IVardi and ZhangI (|2000[) because of memory contraints. 
The value of the tuning parameter was chosen to be = 400, it leads to a value of about 92 for 
the empirical loss criterion. 



5 Concluding remarks 

The experimental results confirm that averaged recursive estimators of the geometric median 
relying on stochastic gradient approaches are of particular interest when one has to deal with 
large samples of data and potential outliers. Furthermore, when the allocated computation 
time is limited and fixed in advance and the data arrive online these techniques can deal, in 
a recursive way, with larger sample sizes and finally provide estimations that are much more 
accurate than static estimation procedures. We have also noted in the simulation experiment 
that they are not very sensitive to the value of the tuning parameter c^. 

One could imagine many directions for future research that certainly deserve further atten- 
tion. Taking advantage of the rapidity of our estimation procedure, one could use resampling 
techniques, similar to the bootstrap, in order to approximate the asymptotic distribution of 
the estimator given in Theorem 13.41 and then build pointwise confidence intervals. Proving 
rigorously the validity of such techniques is far beyond the scope of this paper. 

Our procedure ca n also be extended readily for online clustering, adapting the well known 



MacQueen algorithm (|MacOueenl (|l967l )) to the Li context. Even if the criterion to be optimized 



is not convex anymore, it can be proved that stochastic gradient approaches converge almost 
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surely to the set of stationary points (|Cardot et al.l (|2010bl )) and thus are interesting candidates 
for online clustering. 

Another direction of interest is online estimation of the conditional geometric median when 
real covariates are available. For instance, the age or the size of the city where individual live 
are known by Mediametrie and it can be possible to take such information into account in order 
to get varying time regression models that can also be estimated in a very fast way thanks to 
sequential approaches. 

Finally, as noted in Section |4!2] the independence condition is rarely satisfied for real stud- 
ies. A direction that deserves further investigation is to determine under which dependence 
conditions our results, such as Theorems 13. 1 1 and 13.41 remain true. 



6 Proofs 

6.1 Convexity — Proofs of Proposition 12.11 and Corollary 12.21 

We first show that O is a subgradient of G. For points x ^ D, this is clear since G is Frechet 
differentiable. 

Pick a point in D and recall that O^(xo) = YLi,Xi^xo Vi ||xo-j'|| • have. 



(Orf(xo),y-xo) = ^ Pi 



I ^0 - Xi 



^ {XQ-Xi,y-Xi) ^ 

hXii^XQ II U Ml !,X,7tXo 

< Gi{y) - Grf(xo), 



so that Orf is a subgradient of G^. 

The upper bound in Proposition l2.1l follows immediately from (|6]l and the assumption (A2). 
For the lower bound, thanks to we only need to prove: 



Va G i3(0,A),VM, ||m|| = 1, 



{u,Tau) = E 



\PuiX-oc) 
IIX- all^ 



(15) 



where P„ is the projection on the orthogonal of u. This quantity is small when X — a is in 
span(i;). 

Recall that (by (Al)), X is not supported on a line. Consider the set of subspaces K C H 
satisfying: Vx G K, Var ( (x, X) ) =0. Suppose that this set is non-empty, and let H' be a maximal 
element in it (this exists by Zorn's lemma). The orthogonal of H' has at least dimension 2 
(otherwise, we get a contradiction to Al). Let vi,V2 be two orthogonal vectors in H'^. Let 
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Vt = cos(f)yi + sin(f)z;2- The map 

t ^ Var((ut,X)) 

is continuous on a compact set. Its minimum cannot be zero (since this would contradict the 
maximality of H'). Therefore there exists a c such that, for all unit v in the plane spanned by 
{v^,V2),Var{{X,v)) > c. 

The orthogonal of u (an hyperplane) and the (2-dimensional) plane spanned by vi and V2 
necessarily intersect: there exists a unit vector v G span(z7i, ^2) such that {u, v) = 0. Therefore, 
for all 1/ G H, ||P„yf > {y,vf. In particular, ||P„(X-a)f > (X-a,y)^ 

Suppose first that X is a.s. bounded by K. Then 



{v,X-a.f 
llX-all^ 



[v, X - a) 



It is easily seen that the last term is bounded below by Var ( (y, X) ) > c and ([TSl l holds with 

1 

To get rid of the boundedness assumption on X, we can just choose K large enough so that 
Var((y, X1||x||<_k)) is strictly positive for v = V\,Vr. 

The corollary is a consequence of Proposition 12.11 and of the fact that ^ is a subgradient. 
Indeed, the inequality holds for by interpolation: for an elementary proof, define oli = 
(1 — t)a.i + ta2, and write 0(^2) — = Jq f'{t)dt where /(f) = 0(at). One can then apply 

Gill, t by t, with a = at and " - "^'"^ 
Moreover, 



a2— ai 



(^^(^2) - ^rf(ai),a2 - Oil) = (Orf(a2),a2 - ai) + (^rf(ai),ai - ^2) 

> G(a2) - G(ai) + G(ai) - 0(^2) 
= 0, 

where the second line follows from (|4]|. Since = AOc + (1 — A)Orf, Corollary I2.2l is proved. 
6.2 Proof of Theorem |33l 

The proof of Theorem 13.11 follows a classical strategy and consists of two steps. 

Lemma 6.1. Under the hypotheses ofTheorem \3.1\ there is a random variable V such that, E [| Vp] < 
00, and 

lim \\ Zn — m |P = V, a.s. 



15 



Proof of Lemma 16.11 Let us consider V„ := ||Z„ — m||^. Recall that Z„+i = Z„ — 7„0(Z„) + 
7n?n+i (cf. m)- Therefore 

Vn+\ = \\Zn -m- 7„<I>(Z„)f + 7^, + 27„ (^„+i,Z„ - 7„^(Z„)) ■ 

If we condition with respect to Tn, the last term disappears since (^„) is a martingale difference 
sequence and it comes: 



E [Vn+i\Tn] = ||Z„ - m - 7nO(Z„)f + 7^E [|j^„+if 

= \\Zn - mf - 27„ (Z„ - m,0(Z„)) + 7^ (||^(Z„; 
= y„ - 27,, (Z„ - m, ^(Z,0) + ll, 



■E 



(16) 



where we used the definition of y„ and ([12)) for the last term. Since G is convex, using Corollary 
12.21 we get: 

(Z„ - mMZn)) = (Z„ - m,0(Z„) - > 0. 

Therefor e, for all n, E [V^+ilJ-",;] < V„ + 7^. From the Robbins Siegmund theorem (see for 
instance (|Dufld, 119971, page 18)), we deduce that {V„) converges almost surely to V. Moreover, 
we note that Z„ — ra is bounded in L^, 



Vn, V„ = E 



m\ 



< E 



iZn - ml 



E 7? < 



(17) 



k=l 



whenever E 
00. 



jZo - ra| 



< 00, which is satisfied for example if Zq = Xo1|||Xo||<m}' with M < 

□ 



We can now give the proof the theorem. 

Proof of Theorem |3J] Lemma [6.11 shows that the sequence V„ converges almost surely. Let us 
check now that its limit is zero. Let us take expectations in equation ([T6l l: 

E [Vn+i] = E [V„] IjnlE [(0(Z„), Z„ - m)] 

= E [\/o] + E 7' - 2 E T/cE [{^(Zk), Z, - m)] . 

k=l k^l 

The sequence YJl^i Ifk^ [{^{^k)/ Zk ~ has positive terms, and is bounded above by E [Vq] + 
Ya^i 7k' therefore it converges. This implies in particular that 



E In {^{Zn),Zn - m) < +00 a.s. 



(18) 
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This convergence cannot happen unless Z„ converges to m. Indeed, for each e e]0, 1[, let us 
introduce the set 

= {to E CI: 3ne{co) > l,Vn > ne{to), < V„{co) < e'^] . 
For CO G Cle, we have with Corollary 12.21 

^ 7„ (0(Z„(a;)),Z„(a;) -m) > ^ 7J inf - m) = 00, 

n>l \n>n,{w) J e<\\^->^'\\<>^-' 

which contradicts ([TSl l unless P(ne) = 0. Since V„ converges a.s. to a finite limit, and {lim V„ G 
[c, c^^] } C dc/2f the only possible limit is zero: 

lim ||Z„ — m|| = 0, a.s. □ 
6.3 Proof of Lemma |33] and Proposition K2\ 

Proof of Lemma [3Jl Consider, for h G 6(0, A), the function = + th), defined for 

t G [0, 1]. We have /;,(0) = ^(m) = and /;,(1) = 0(m + h). It is also clear that the first order 
derivative of function /;, satisfies /^(O = ^m+th- Consequently, a Taylor expansion with 
integral remainder of about t = gives us 



^{m + h)= + r T^^thih) dt. 

Jo 



By Lemma 5.7 in lChaudhuril (|l992l ), there is a constant such that for all f G [0, 1], 

\\^m+th - Tm II L < \\h\\ 

where || . is the usual norm for bounded linear operators. Since 0(m) = 0, one gets 



\^{m + h) -T,„{h)\\ < sup ||r„+f;, - r„||L ||/7|| < ||/i 

fG[0,l] 



l2 



and this concludes the proof. □ 
Proof of Proposition 13.21 The proof is composed of 5 steps. 
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Step 1 — a spectral decomposition. Recall that Tm is: 

1 {X - m) (g) {X - m) 



E 

E 

E 



\X — ml 



\X — m 



\X-m\ 



1-1 



Ih-E 



|X-m||2 

1 /(X- m) (g) (X- m) 



X — m 



|X-m||2 



(19) 



Since r„, is bounded and symmetric, it is self-adjoint. Moreover, the operator A„, defined by 
([T9l l is trace class: it is self-adjoint, non negative, and if (e^ ) is an orthonormal basis. 



< E 



[X-m,ejy 
llX-m||^ 



X-m 



< 00. 



Therefore A,„ is compact, and there is an increasing sequence of eigenvalues (Ay), with 
possible repetitions, and an orthonormal basis {vj) of eigenvectors in H such that: 



V; G N, YmVj = AjVj, 

a{Y,n) = {Ay,;GN}u|E 



\X — m 



1-1 



\j > E 



IX-m 



1-1 



Moreover, thanks to 0, the smallest eigenvalue A„„„ of is strictly positive. For simplicity of 
notation, we rewrite this decomposition as follows, 

TmX = ^ {ex, x) ex, x E H, 
AeA 

where A is the multiset {Ay,y G N}, that can account for eigenspaces of dimension larger than 
1. 

In the following, we will need the operators: 



^Ic = Ih - 7k^m, 



/3„ = ■■ - Oil. 



Since F^ is bounded, these operators are well defined. Introducing the sequence of real func- 
tions, for n G N, 

fn{x)=U{^-lkX), 



we see that /n ( • ) and /„ ^ ( • ) are well defined on cr(r^ ), provided 7,;E 



IX- ml 



< 1, which 
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we can assume without loss of generality. Elementary analysis shows that there exist constants 
Ci, C2, C3 such that: 

Vx e cr(r,„), ciexp {-s„x) < fn{x) < C2 exp {-s„x) , 

(20) 

< C3, 



^7 



1 — a. 

where we recall that s,, = Y2=i "Yk, arid = c^k^'^. Then each operator /3„ can be also expressed 
as follows: 

^„x = ^ /„(A) (eA,x) X e H, 

aga 

their inverses are bounded operators, and satisfy: j6^^x = Y^a^a/h^W {^a, ^) 

Step 2 — Decomposition of the algorithm. Let us rewrite the algorithm in the following way 

= Zn + " 7n(rm(Z„ - m) + 

where = 0(Z„) — r„,(Z,i — is the difference between the gradient of G and the gradient 
of its quadratic approximation. Therefore: 

\/k, Zk+i-m = iXk{Zk-m)+jk^k+i-7A (21) 
Rewriting a„_ia„_2 • • • j6„_ij6^^, we get by induction, 

Z„ - m = ^n-l (Zl - m) + ^n-lMn - ^n-lRn-lr (22) 

where 

)c=l lc=l 

The first two terms of (|22|) are what we would get if G was exactly quadratic: a deterministic 
gradient part going to m, and a noise part; R,; is the error term. We will look at each of these 
terms in turn. 

Step 3 — The deterministic term. We want to boimd /3„_i(Zi — m). The asymptotic be- 
haviour of fn in eq. (|20l l implies that 

< C2exp ( 
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where A„„„ > is the smallest eigenvalue of r,„. Therefore 



E 



||/3„_i(Zi-/«)f <Cexp -2ni-ME 



1 Zi —ml 



(23) 



Step 4 — The martingale. The fact that the /S^ are operators (instead of real numbers) makes 
matters more complicated. To deal with this problem, we use the spectral decomposition of the 
sequence of self-adjoint operators {jik)- 

More precisely, we decompose M„ = I^aga (^A/ M,,) = Ea ^n^A- For each A G A, M;^ is 
a martingale, and 



k<n-l 



since E [(^)t'/ ^a) (^it+i/ ^a) l-^it] = when /c' < A: + 1. Summing now over A G A, we get: 



E 



\\Pn-lMnt\ =X:/,il(A)E [(M^) 

A 



A )c<n-l V /'^V^^ 



E 



(^Jt+i/eA)' 



However, for any k, n, and any A G A, 



/f!-l(A) j-j ^-[^ _ ^ fn-l{Knin) 



hi^) Ail 



/fc(A„ 



This uniformity in A allows us to reconstruct E 
(IE). We obtain: 



, which is bounded by 1, thanks to 



E 



\\^n-lMn\ 



< E '^/c S-rr^) E^ (w^'a) 

k<n-l \ JkV^min) 



^ ^ ^2 / fn-l{^min) 
k<n-l V fki^min) 

k<n-l ^ fki^min) 



A 



E 



Now we use the bounds (|20]l on j6, 



E 



||/3„-iM„fj <^ E 7.'exp(- f: 7; 

''1 ;c<;i-l V i=k+l , 



£C E 7fexp(-^(„.-.-*.-«)) 

k<n—l \ / 



(24) 



20 



The exponential terms are very small when k is much smaller than n, therefore we isolate the 
last terms. To do that, we choose l{n) such that. 



(25) 



with Ca to be chosen later. The first part of the sum (|24)) (for k < l{n)) gives us: 

k<l{n) 



E Tii-exp -- 

k<l{n) ^ 



k'-')) < L 7.^exp(--^ln(n)) 



(26) 



This can be made smaller than any prescribed inverse power of n, if we choose large enough. 
In the second part of the sum (|24)), for k > l{n), we bound the exponential by 1 and by 7/(n): 

E 7^exp(--l-(ni--/ci-)) <(n-/(n))7?(„). 

k>l(n) ^ ^ 

The number of terms n — Z(n) is equivalent to ln(n)n'*, and Ji^n) ~ c^n^". Therefore, the 
whole second term is equivalent to cln(n)n^'', where c depends on Ca and Cy. For Ca large 
enough, this dominates the first term (|26)). Finally we get: 



\\^n-lM„\ 



< C 



ln(n) 



(27) 



Step 5 — the error term and the conclusion. The error term is f^n-i ILk=i 7kf^k where 
Sk = ^{Zk) - T,n{Zk - m). With Lemma |331 we get that 



3r, Q V/c, llZjt — mil < r 



(28) 



Since Z„ converges a.s. to m, we deduce two things about 3]^: it is almost surely bounded, and 
((28) becomes a.s. eventually true. To use these facts we introduce the following sequence of 
events: 

Vn > N,Vfc > n - /(n), \\Z,,{a)) - m\\ < 1/K ' 
and ||(5/f(a;)|| < C,- ||Zj.(a;) — m\\ 



N 



CO, 



yk,\\5kicv)\\ < N. 

for a value of K to be chosen later, and l{n) defined by (|25l l. This sequence is increasing and 
U Q]v = O; from now on we work on 0^. 

Once more, since /3„_i/6^^ is very small when k is much smaller than n, only the last terms 
in the sum defining Rn matter. This is why we re-use the definition of l{n) and cut the sum in 
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two parts. For co G On, and n > N, 



//(«) 










/l{n) 






\i-=1 





^;c=/()3)+i 



C2 
'—I 



;c=/()3)+i 



< 



where we used the crude bound \\Si^\\ < N in the first part, and for the second part, /3„_ij6 
1 and the definition of On- 

As before, it is easy to see that the first term is bounded by any prescribed inverse power of 
n, say n^'^^. For the second term, we already know that (n — Z(w))7;(„) is bounded. Therefore, 
on d]s} and for n > N, 



k=l{n)+l 

Combining now (|22ll , (|23l) , ^7) and (|29l l, we get, for n > N and some new constant C 



(29) 



E 



ln„ \\Z„-m\ 



^ Cln(n) C ^ ^ 

/c=!(n)+l 

Cln(n) C 

" ^ l{n)<k<n 



Ifi^ \\Zk-m\ 



Iom ||Z/c-m| 



Let us choose K such that > 2C'. Then 

Cln(n) 1 



Vn > N, E 



lo^, ||Z„ - ml 



< 



H — max E 

2 l{n)<k<n 



Defining m„ = E 



Iom - ml 



, this reads: 



Cln(n) 1 
Vn > N, u„ < H — max uj-. 

2 l{n)<k<n 



(30) 



Let us prove by induction that, for some N' large enough, and for C" = AC, 



Vn > N', M„ < 



C"ln(n) 



Suppose that W/^ < for all k < n, and let us prove that m„+i < ^-^^i^^jyp-- Using (|30l l, we 
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know that: 



^ Cln(n + 1) , 1 
M„ 1 1 < — -, r — + - max lu 

- (n + l)« 2 /(„+i)<jc<„+i ^ 

Cln(n + 1) 1 / C"ln(/(n + l)) 

< / , -N. + max m„+i. 



If the max on the right hand side is m^+i, we get: 

1 ^Cln(n + 1) 



2 (n + l)« 

which is the desired result since 2C < 4C = C". If this is not the case, then the max is 
qg^. However, \{n + 1) ^ (n + 1) so for n larger than some N', ifg^^ < fig^. 
Hence 

Cln(n + 1) 3 Cnn(n + l) 
""+1 ^ (n + l)'^ + 4 (n + l)« 

(n + l)« {n^xy 
This concludes the induction step and the proof of Proposition |3.2| □ 

6.4 Proof of Theorem iMl 



We use the same decomposition as in lPelletieii (|200d ). It consists in linearizing the target func 



tion O around the true value m. Recall the following decomposition of the error (|2T|), 

V/c, Z,t+i - m = (Ih - 7;crm) [^k -m)+ -fk^k+i - Ikh, 

where ^k is a martingale difference sequence and 5k are error terms, 5k = ^(Z/J — Tm{Zk — m). 
Defining now, 

n 

Tn:=Zn-m, Tn:=Zn-m and Mn+i := Y^^k+i> 

k=l 

and rearranging the previous expression, we obtain: 

1 

7fc 



Summing these equalities, it comes. 



"1 n ^ 

2r„,T„ = ^ — (T,, - T^+i) -Yj^k + Mn+l. 

k=l A:=i 
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Applying Abel's transform, and dividing by i/n yields: 



Vn^niTn = ^ ( — - + I] Tic 



n V 7i In 



7k 7k-i 



" \ 1 ~ 



To prove that last term is a martingale for which the CLT holds. 



M„ 



>Ar(0,E), 



we need to check that the assumptions of Theorem 5.1 in (|Takubowski Il988l ) are fulfilled. We 
first have that the martingale difference sequence is a.s. bounded, Vn || ^„ || < 2. Let us define 

which can also be decomposed as follows 

(X-Z„) ^ (X-Z„ 



Z„ = E 



X — z« 



X — z« 



0(Z„)®0(Z„). 



Since <!>(?«) =0, we have by a direct computation, 

|0(Z„)||<E^ ^ 



X-m 



|Z„ — mil . 



Using now, for (fl, &) S H x H, the inequality ||fl (X) fcH^ < ||&|| , where ||fl is the usual 

the norm for linear operators, we directly get, with Theorem 13.11 

||<D(Z„)0^(Z„)||l^O, fl.s. 

With similar arguments, it is easy to show that 



E-E 



(X-Z„) ^ (X-Z„) 



X — z„ 



X — Zm 







< 2E 








L 





(X-Z„) (X-m) 



X-ZJ x-m 



< 4E 



IX -ml 



m 



so that II — Z II — > fl.s., when n tends to infinity Then condition 5.2 in (|Takubowskil,ll988l) is 
satisfied and is a consequence of a direct application of Chow's Lemma, see for instance jPuflo , 
19971. page 22). 

Now, it remains to prove that 



1 / T„ 



+1 



n \ In 



k=2 



1 



Ik lk-1 



k=i , 



(31) 
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IOm \\Tn\ 



< Cn^^^, thanks to Proposition 13.21 For the first term An 



Let us denote by A„, A'„ and A[[ the three terms. 
Recall that E 

T 

, we have: 



E 



IOm II ^f! I 



Therefore A„ > 0. 

n^oo 



Let us turn to the second term A'„. Since 7]^^ — ^ 2ac^ ^fc** \ we have, for two positive 
constants Q, Ci, 



EE[lnJ|r,||]fc-i 



< 



Co 



k<n 



,a/2-l 



< Ci\/ln(n)n 



a/2-1/2 



which eoes to zero since a < 1. Therefore A' > 0. 

Finally, for the last term A", since there exists a positive constant C2 such that ||^;c|j < 

C2 ||Z;t ~ '^ll / we have: 



E[inji<ii] <^i:e 



lOw ||T\| 



< ^ E Mm-"- 



k<n 

Since the right hand side term converges to zero (as can be seen e.g. by Kronecker's lemma, 
using the fact that a > 1/2), C„ — > 0, therefore ((31) holds, and Theorem |3.4| is finally proved. 

" n^oo 
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Figure 1: Estimations of the mean and the geometric median audiences, at a second scale, 
during the 6th September 2010. 
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